Sources of Success for Information Extraction Methods

نویسندگان

  • David Kauchak
  • Joseph Smarr
  • Charles Elkan
چکیده

In this paper, we examine an important recent rule-based information extraction (IE) technique named Boosted Wrapper Induction (BWI), by conducting experiments on a wider variety of tasks than previously studied, including tasks using several collections of natural text documents. We provide a systematic analysis of how each algorithmic component of BWI, in particular boosting, contributes to its success. We show that the benefit of boosting arises from the ability to reweight examples to learn specific rules (resulting in high precision) combined with the ability to continue learning rules after all positive examples have been covered (resulting in high recall). As a quantitative indicator of the regularity of an extraction task, we propose a new measure that we call SWI ratio. We show that this measure is a good predictor of IE success. Based on these results, we analyze the strengths and limitations of current rule-based IE methods in general. Specifically, we explain limitations in the information made available to these methods, and in the representations they use. We also discuss how confidence values returned during extraction are not true probabilities. In this analysis, we investigate the benefits of including grammatical and semantic information for natural text documents, as well as parse tree and attribute-value information for XML and HTML documents. We show experimentally that incorporating even limited grammatical information can improve the regularity of and hence performance on natural text extraction tasks. We conclude with proposals for enriching the representational power of rule-based IE methods to exploit these and other types of regularities.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A review on EEG based brain computer interface systems feature extraction methods

The brain – computer interface (BCI) provides a communicational channel between human and machine. Most of these systems are based on brain activities. Brain Computer-Interfacing is a methodology that provides a way for communication with the outside environment using the brain thoughts. The success of this methodology depends on the selection of methods to process the brain signals in each pha...

متن کامل

آموزش سواد اطلاعاتی به کودکان 7 تا 11 ساله ایرانی

Purpose: To develop instructional objectives for implementing an information literacy instruction program for Iranian children (7-11 years old) based on the information literacy standards of American Association of School Library (AASL). Methodology: In this research, the following methods were used: a literature review in order to extract the instructional objectives of information literacy b...

متن کامل

A review on EEG based brain computer interface systems feature extraction methods

The brain – computer interface (BCI) provides a communicational channel between human and machine. Most of these systems are based on brain activities. Brain Computer-Interfacing is a methodology that provides a way for communication with the outside environment using the brain thoughts. The success of this methodology depends on the selection of methods to process the brain signals in each pha...

متن کامل

Information needs assessment of nursing offices based on critical success factors and business system planning

Introduction: Proving management with a reliable information system, can facilitate decision making regarding planning, organization and control. This study aims to evaluate information needs of Iranian Nursing Offices. Methods: This applied, cross-sectional research utilizes critical success factor (CSF) and business system planning (BSP) methods to assess information needs Iranian nursi...

متن کامل

Evaluation of health information requirement in management information System

Introduction: Considering the importance of information, providing the management with a reliable information system, can facilitate decision making regarding planning, organizing and controlling. This study aimed to analyze and evaluate information needs of managers at vice - chancellorship for treatment in Iranian medical science universities. Methods: This cross-sectional study was car...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2002